The aim of this analysis is to investigate diabetes prevalence over time. The analysis looks at the variables year, country and sex to identify groups with high diabetes prevalence.
The dataset ‘DIABETES evolution of diabetes over time’ is a global dataset of diabetes prevelance from the years 1980 to 2014 and contains a total of 14,000 observations and 7 variables:
Table 2.1 below shows the first six observations of the full dataset.
# Read in Data
data_full <- read_csv("Data/Diabetes_data.csv")
# create variable of for first observations
data_full_head <- head(data_full)
# display in table
kable(data_full_head,
caption = "First Six Observations of the Full Diabetes Dataset",
digits = 2)
| Country/Region/World | ISO | Sex | Year | Age-standardised diabetes prevalence | Lower 95% uncertainty interval | Upper 95% uncertainty interval |
|---|---|---|---|---|---|---|
| Afghanistan | AFG | Men | 1980 | 0.04 | 0.02 | 0.09 |
| Afghanistan | AFG | Men | 1981 | 0.05 | 0.02 | 0.09 |
| Afghanistan | AFG | Men | 1982 | 0.05 | 0.02 | 0.09 |
| Afghanistan | AFG | Men | 1983 | 0.05 | 0.02 | 0.09 |
| Afghanistan | AFG | Men | 1984 | 0.05 | 0.02 | 0.09 |
| Afghanistan | AFG | Men | 1985 | 0.05 | 0.02 | 0.09 |
The full dataset was reduced to 1000 observations through a random generation of row numbers. The variable “ISO” was removed as it was not necessary for analysis. The reduced data has 6 variables (Although the limit is 5 variables, I considered the lower and upper 95% confidence interval variables as one variable). Figure 3.1 below shows the code used to tidy the full dataset into the reduced dataset.
include_graphics("Image/code_screenshot.png")
Figure 3.1: Code Screenshot of Data Tidying
Using the function str() the first 2 rows of the data is displayed to show the type of variables in the data set (numeric, character/factor etc.).
# first display only first 2 rows
head_data_2 <- head(data,2)
str(head_data_2)
## tibble [2 × 6] (S3: tbl_df/tbl/data.frame)
## $ Country/Region/World: chr [1:2] "Micronesia (Federated States of)" "Pakistan"
## $ Sex : chr [1:2] "Men" "Women"
## $ Year : num [1:2] 2013 1982
## $ diabetes_prevalence : num [1:2] 0.2003 0.0612
## $ lower_95 : num [1:2] 0.124 0.028
## $ upper_95 : num [1:2] 0.293 0.115
Mean and standard deviation were calculated for diabetes prevalence by “Year”. Table 4.1 shows the results of the summary statistics. This section requires grouping by a factor/character variable. ‘Year’ is a numerical variable but was chosen here to better reflect the research question.
# group data by year and create summary statistics
data_summary <- data %>%
group_by(Year) %>%
summarise(mean_diabetes = mean(diabetes_prevalence),
sd_diabetes = sd(diabetes_prevalence),
mean_upper95 = mean(upper_95),
sd_upper95 = sd(upper_95),
mean_lower95 = sd(lower_95),
sd_lower95 = sd(lower_95))
# display only 10 observations (latest years)
tail_data_summary <- tail(data_summary, 10)
# create table
kable(tail_data_summary,
caption = "Mean and Standard Deviation of Diabetes Prevalence by Year (First 10 Rows)",
digits = 3,
row_number(10))
| Year | mean_diabetes | sd_diabetes | mean_upper95 | sd_upper95 | mean_lower95 | sd_lower95 |
|---|---|---|---|---|---|---|
| 2005 | 0.083 | 0.033 | 0.117 | 0.042 | 0.025 | 0.025 |
| 2006 | 0.085 | 0.046 | 0.120 | 0.058 | 0.036 | 0.036 |
| 2007 | 0.095 | 0.050 | 0.132 | 0.063 | 0.038 | 0.038 |
| 2008 | 0.092 | 0.037 | 0.129 | 0.046 | 0.031 | 0.031 |
| 2009 | 0.078 | 0.026 | 0.116 | 0.035 | 0.020 | 0.020 |
| 2010 | 0.083 | 0.035 | 0.125 | 0.048 | 0.026 | 0.026 |
| 2011 | 0.089 | 0.050 | 0.131 | 0.065 | 0.037 | 0.037 |
| 2012 | 0.104 | 0.061 | 0.156 | 0.082 | 0.043 | 0.043 |
| 2013 | 0.105 | 0.058 | 0.163 | 0.081 | 0.038 | 0.038 |
| 2014 | 0.083 | 0.040 | 0.135 | 0.058 | 0.026 | 0.026 |
From Table 4.1 we can see an increasing trend in mean diabetes prevalence from 2005 to 2014. 2009 had the highest mean diabetes prevalence at 11.1% from the period 2005 to 2014, but also the highest standard deviation.
A figure was created using the ggplot2 R package and the option geom_point(). This is displayed in Figure 5.1.1:
Figure_2 <- ggplot(data = data_summary, aes(x = Year, y = mean_diabetes)) +
geom_point(alpha = 0.7) +
xlab("Year") +
ylab("Mean Diabetes Prevalence") +
theme_minimal() +
geom_smooth() +
geom_errorbar(aes(ymin=mean_diabetes-sd_diabetes, ymax=mean_diabetes+sd_diabetes), colour="red", alpha=0.3)
ggplotly(Figure_2)
Figure 5.1: Mean Diabetes Prevalence Increases Over Time
# first filter for australian data
Australia_summary <- data_full %>%
filter(`Country/Region/World` == "Australia")
Figure_3 <- ggplot(data = Australia_summary, aes(x = Year, y = `Age-standardised diabetes prevalence`, col = Sex)) +
geom_point(alpha = 0.8) +
xlab("Year") +
ylab("Mean Diabetes Prevalence") +
theme_minimal() +
geom_smooth()
Figure_3
Figure 5.2: Men have Higher Risk of Diabetes
Figure 5.2.1 shows a trend of increasing mean diabates prevalence over time. Men have a noticeably higher mean than women. There is a steep increase from 1980 to 2000 and then a plateau. Data was only available up to 2014. It is unknown whether the plateua begins to trend downwards.
# filter for five countries and group by sex
Australia_table_summary <- data_full %>%
filter(`Country/Region/World` %in% c("Australia", "Germany", "China", "South Africa", "United States of America")) %>%
select(-ISO) %>%
group_by(`Country/Region/World`, Sex) %>%
summarise(`Mean diabetes prevalence` = mean(`Age-standardised diabetes prevalence`))
kable(Australia_table_summary,
caption = "First Six Observations of the Full Diabetes Dataset",
digits = 3)
| Country/Region/World | Sex | Mean diabetes prevalence |
|---|---|---|
| Australia | Men | 0.064 |
| Australia | Women | 0.047 |
| China | Men | 0.060 |
| China | Women | 0.061 |
| Germany | Men | 0.056 |
| Germany | Women | 0.040 |
| South Africa | Men | 0.069 |
| South Africa | Women | 0.097 |
| United States of America | Men | 0.065 |
| United States of America | Women | 0.054 |
Five random countries were selected to compare mean diabetes prevalence by year and sex. In Australia, Germany and United States of America, men have a higher mean diabetes prevalence than women. Mean diabetes prevalence for men and women in China are very similar with men being 0.001 higher. Interestingly, women in South Africa have a higher mean diabetes prevalence than men.